Skip to content

Add count-trailing-zero libcalls #597

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

calc84maniac
Copy link
Contributor

I've implemented optimized count-trailing-zero libcalls, and I have an ez80-clang commit ready to push to make it use them. They've also been added to ez80_builtin implementations where applicable, and tested via ez80_builtin/stdbit tests.

Comment on lines +20 to +32
tst a, 030h
jr z, .high2
dec a
and a, 014h
ret po
ld a, 5
ret
.high2:
add a, a
sbc a, -8
ret p
ld a, 6
ret
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unfortunately breaks the optimization of the first trailing one functions relying on this to produce an output of 8 for an input of 0, but it might be worth it?

Suggested change
tst a, 030h
jr z, .high2
dec a
and a, 014h
ret po
ld a, 5
ret
.high2:
add a, a
sbc a, -8
ret p
ld a, 6
ret
add a, a
add a, a
jr z, .high2
add a, a
add a, a
sbc a, -5
ret
.high2:
sbc a, -7
ret

Copy link
Member

@runer112 runer112 Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this change would also hurt for maybe eventually implementing std::countr_zero, wouldn't it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, that assumption is also made by the compiler (this libcall implements CTTZ which is output by __builtin_ctz intrinsics on the Z80 target and not CTTZ_ZERO_UNDEF). Good optimization idea though, I'll have to think whether I can do something similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this change would also hurt for maybe eventually implementing std::countr_zero, wouldn't it...

It's actually already been implemented in the toolchain recently, and makes use of the Z80 intrinsic behavior I mentioned.

Copy link
Member

@runer112 runer112 Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that, except for that pesky 0 case, you could potentially rework this whole routine to be based on branches after pairs of left shifts. I think whether that would be faster or not depends on the input distribution, but it would certainly be smaller.

Comment on lines +37 to +38
ld a, l
or a, a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matching __icttz's first 3 bytes might compress better.

Suggested change
ld a, l
or a, a
xor a, a
or a, l

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I always forget to think about compression alongside other size optimizations.

@runer112 runer112 added the libc label Apr 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants